Add the SQLStorm query suite to vortex-bench (nightly-only)#8165
Draft
mprammer wants to merge 22 commits into
Draft
Add the SQLStorm query suite to vortex-bench (nightly-only)#8165mprammer wants to merge 22 commits into
mprammer wants to merge 22 commits into
Conversation
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Implements Tasks 2 & 3 of the SQLStorm benchmark.
- sqlstorm/sqlstorm_benchmark.rs: SqlstormBenchmark implementing Benchmark,
parameterized by SqlstormOrigin. Mirrors TpcDsBenchmark; TPC-H/DS origins
reuse canonical SF=1 paths; StackOverflow/JOB get sqlstorm-<origin> dirs.
- sqlstorm/data.rs: table_names() as single source of truth per origin;
table_specs() delegates here; async data-gen stubs bail for so/job.
- sqlstorm/mod.rs: re-enable sqlstorm_benchmark module and re-export;
add FromStr impl and from_name() helper to SqlstormOrigin.
- datasets/mod.rs: BenchmarkDataset::Sqlstorm { origin } variant with
name(), Display, and tables() arms delegating to data::table_names.
- lib.rs: BenchmarkArg::Sqlstorm, imports, create_benchmark arm reading
--opt origin=<name> (default TpcH).
- v3.rs: benchmark_dataset_dims arm for Sqlstorm (origin -> dataset_variant)
and matching test case.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: mprammer <martin@spiraldb.com>
Replaces the stub in `sqlstorm/data.rs` with a full async `generate_stackoverflow` implementation that: - Downloads the upstream schema DDL and ~1 GB gzip tarball from `db.in.tum.de/~schmidt/data/` using the shared `download_data` helper (idempotent, progress bar, retry). - Shells out to `tar -xzf` to extract the 13 camelCase CSV files. - Locates the CSV directory (flat or single-subdirectory archive layouts). - Builds and runs a single DuckDB script that reads the schema DDL, COPYs each headerless CSV into a typed table, then COPYs each table to a Parquet shard with all column names lowercased — mirroring the Appian benchmark's identifier-normalization approach so DataFusion's `enable_ident_normalization=true` resolves queries correctly. - Guards idempotency: skips all work if all 13 `parquet/*.parquet` shards are present. - Only runs for `file://` data URLs; remote data directories are assumed to already contain the shards. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Selected from SQLStorm v1.0 (pinned SHA b3bb0b9) by running candidate queries through both DuckDB and DataFusion 53 over the origin Parquet, keeping only those that execute on both (refill-on-failure). Provenance in queries.csv. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Adds 4 per-origin sqlstorm entries to the nightly matrix and an additive --opt origin passthrough in the reusable sql-benchmarks workflow. Not added to the per-PR default matrix. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Selected from SQLStorm v1.0 (pinned SHA b3bb0b9). Each vendored query runs on BOTH DuckDB (CLI) and DataFusion (the real datafusion-bench harness) within SQLStorm.s ~10s per-query budget. DataFusion is checked via the harness, not datafusion-cli, whose extra subquery decorrelation accepted EXISTS queries the harness cannot physically plan. queries.csv is a complete log of every tested (query, engine): status is works/error/timeout/crash with the failure reason, so future workers do not re-test or re-add known-bad queries as the corpus scales. 500 works on both engines; the rest are recorded failures (e.g. 26 DataFusion timeouts >10s, plus unsupported-SQL errors). DuckDB is tried first; DataFusion is only tried when DuckDB passes, to bound selection runtime. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Drop dead row_counts.rs placeholder (returned None for every origin; SQLStorm query IDs are too sparse for a Vec<usize> indexed by query_idx, and the Benchmark trait default of None already covers this). Drop SqlstormOrigin::all() and the clap::ValueEnum derive on SqlstormOrigin - both unused. The CLI exposes the suite via BenchmarkArg::Sqlstorm; the origin itself is read via Opts::get_as::<SqlstormOrigin> which goes through FromStr. Drop the now-redundant expected_row_counts override on SqlstormBenchmark and update the stale doc comments that referenced a stackoverflow.dbschema.json (the upstream stackoverflow_schema.sql DDL is the actual source) and a "later task" for JOB tables (now inlined as JOB_DDL). Swap tpch/5862.sql for tpch/1261.sql: 5862 surfaces a DuckDB <-> Vortex bridge bug that the original DuckDB-CLI + DataFusion-harness selection could not see. Its queries.csv rows are removed (not annotated) so a future full-coverage selection pass will re-evaluate 5862 from scratch once the bridge bug is addressed -- queries.csv is an add-this / known-to-fail gate for future selection passes, not a permanent audit log, and 5862 is not "known-to-fail" in the SQL-compat sense. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
…_csv_dir StackOverflow DDL was downloaded from the upstream URL at data-gen time and then filtered to strip ALTER TABLE FOREIGN KEY statements that DuckDB rejects. Inline it as STACKOVERFLOW_DDL alongside the existing JOB_DDL: drops one network dep, the line-based FK filter, the SCHEMA_URL const, and the anyhow::Result return on build_duckdb_script (no file read remaining). Column types and NOT NULL constraints are preserved verbatim; inline references / primary key declarations are stripped since they are not enforced by COPY. JOB extraction was passing base_dir directly to build_job_duckdb_script, assuming the upstream imdb.tzst always lays CSVs flat. Route it through locate_csv_dir (which generate_stackoverflow already uses) so both origins handle a possible wrapping subdirectory consistently. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: mprammer <martin@spiraldb.com>
Collapse the near-identical generate_stackoverflow/generate_job paths into a single generate_origin(&OriginData) driver parameterized by a per-origin recipe. Bake lowercase column names into each DDL so a plain `SELECT *` exports lowercase Parquet, dropping the TABLE_COLUMNS/build_projection machinery (~90 lines). Unify tar.gz/tzst extraction behind extract_archive + an Archive enum, and switch data-gen idempotency to a `.success` sentinel. Move StackOverflow/JOB data under a shared `sqlstorm/<origin>/` dir (mirroring the vendored-query layout) and reuse the crate `DEFAULT_SCALE_FACTOR` const instead of a literal "1.0". Add tests guarding the two table-name invariants that otherwise surface only at nightly data-gen time: tables<->table_names() (registration vs gen) and DDL CREATE TABLE names <-> COPY tables. Also add data-url layout / remote-override tests for the per-origin path resolution. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
Document layout, the by-hand refresh procedure (verify candidates against the bench's own DataFusion SessionContext, not datafusion-cli), the pinned upstream SHA, and how to run each origin locally. Remove queries.csv: the current phase is performance over the vendored 500-query sample, not SQL-compatibility completeness, so the machine-readable pass/fail log is no longer carried in-tree; a future full-corpus selection pass re-evaluates candidates from scratch. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
The earlier long-runner cleanup picked the lowest-id passing candidate (refill `sort -n`), biasing the replacements toward low ids (job 5..156, stackoverflow/tpcds both id 3). Re-pick those slots with a seeded-random shuffle of the corpus plus a short-runtime cap (<=2.0s/engine on parquet+vortex, matching the kept set's median 0.40s / max 1.63s envelope): 30 job ids now span 111..34217, stackoverflow id 3 -> 33961, tpcds id 3 -> 7155. Each new query was confirmed to pass DuckDB and DataFusion; all four origins were re-validated end-to-end strict (125/125 per origin, both engines, parquet+vortex). Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
Add a "Data size (fixed scale)" section to the sqlstorm README spelling out the fixed per-origin sizes (TPC-H/TPC-DS at SF1, StackOverflow `dba` ~1 GB, JOB the fixed IMDB snapshot), that `--opt scale-factor` is silently ignored, and that this mirrors upstream (OLAPBench selects size per origin; there is no uniform scale knob). Add a matching comment at the benchmark factory noting the opt is intentionally not read. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
Add a fixed SQLSTORM_TPC_SCALE_FACTOR const (10.0) driving the TPC-H/TPC-DS data paths and delegated generation, replacing the crate DEFAULT_SCALE_FACTOR (now reverted to private, since sqlstorm no longer imports it). SQLStorm still has no user-facing scale factor; this just moves the fixed point to SF10. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
Point the StackOverflow OriginData at stackoverflow_math.tar.gz (~12 GB) instead of the dba (~1 GB) tarball; identical 13-table schema, more rows. Add a guard test pinning the tier, and fix the now-stale factory comment. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
…ath CSVs The `math` tier's large free-text columns (Posts.body, PostHistory.text, …) contain rows whose embedded quotes don't strictly comply with RFC-4180, which makes DuckDB's CSV dialect sniffer fail outright (it could parse the smaller `dba` tier). Pin the dialect explicitly and parse leniently via extra_copy_opts: `AUTO_DETECT false, QUOTE '"', ESCAPE '"', strict_mode false, ignore_errors true`. Verified end-to-end: all 13 shards generate, 39.6M rows total. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
…w scale At SF10 (tpch/tpcds) and the StackOverflow math tier, some queries curated to be short at SF1/dba now exceed budget. Re-curate each origin: drop queries that fail or exceed ~5s/engine (parquet+vortex, 1 iter) at the new scale, and refill to 125 with seeded-random short-at-scale candidates from the pinned corpus, verified on both DuckDB and DataFusion. Swaps: tpch 29, tpcds 4, stackoverflow 15. JOB unchanged. ~80% of each origin's prior set survives, so the query mix stays close. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
Update the README for the new fixed scales: TPC-H/TPC-DS now generate dedicated SF10 datasets (no longer reuse the SF1 data), StackOverflow uses the math tier. Refresh the Data size table with measured row counts (so 40M / job 74M / tpch 87M / tpcds 192M — within one order of magnitude), note the scale is set in code (SQLSTORM_TPC_SCALE_FACTOR / the STACKOVERFLOW recipe) not a runtime flag, and that the refresh procedure verifies candidates short (<=5s/engine) at scale. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
Resolve conflicts in the benchmark registration touchpoints where develop added the Appian benchmark in the same slots this branch added SQLStorm: keep both in the BenchmarkDataset enum + name/Display/tables arms (datasets/mod.rs), the BenchmarkArg enum + create_benchmark match (lib.rs), the v3 dataset-dims mapping, and the orchestrator README benchmark list. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
The public module doc and `table_names` doc linked to `OriginData::tables`, a private field, which `cargo doc` rejects under `-D warnings` (rustdoc::private_intra_doc_links). De-link both to plain code spans; the field-level docs that link the same private items are private-context and stay. Signed-off-by: mprammer <martin@spiraldb.com> Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the SQLStorm query suite to vortex-bench as a nightly-only benchmark. SQLStorm is an LLM-generated SQL stress corpus; this vendors a confirmed-working sample of 500 queries — 125 each across four schemas (stackoverflow, job, tpch, tpcds) — and runs them Parquet-vs-Vortex on both DataFusion and DuckDB, structured the same way as the existing TPC-DS benchmark so it needs no runner changes. The point is coverage: these queries exercise joins, subqueries, and aggregation shapes that TPC-H and TPC-DS don't, so they stress Vortex's scan and compute breadth cheaply.
It runs only in the overnight
nightly-bench.ymlmatrix, never the per-PR path. The four datasets are sized to sit within one order of magnitude of each other (40M–192M rows): TPC-H and TPC-DS generate their own data at scale factor 10, StackOverflow downloads the ~12 GB "math" tier, and JOB downloads the IMDB snapshot; the two non-TPC schemas convert to Parquet once and cache behind a.successmarker. There is no scale-factor knob — each schema runs at a single fixed size set in code — and the vendored queries are curated to pass both engines and stay under ~5 s each at that scale, keeping the nightly query wall around 22 minutes.Query sample and the fuzzer to come
The 500 vendored queries are a deliberately small, fixed, hand-verified slice of SQLStorm's ~62k-query corpus, pinned at a known SHA and curated to run deterministically and cheaply enough for nightly. That fixed shape is the near-term step on purpose. The longer-term goal is a SQLStorm fuzzer that samples or regenerates from the full corpus on each run to surface Vortex-vs-Parquet correctness and performance divergences across a far wider query space than a frozen sample can reach. That fuzzer is explicitly not this PR — this lands only the fixed sample plus the data-acquisition and harness plumbing it needs, and is the foundation the fuzzer would build on later.
Testing
The 500-query suite runs only in nightly, so per-PR CI doesn't exercise it; it was validated manually by running all four schemas strict (the harness aborts on the first failing query) on both engines across Parquet and Vortex — 125/125 each. The added unit tests run in the normal workspace test job: data-directory resolution per schema, the table-name drift guards (registration vs data-gen, and DDL vs COPY), and the StackOverflow tier pin.
🤖 Generated with Claude Code